May 29, 2015

Background

HUM 110

All Reed freshman are required to take Humanities 110.

Statistics at Reed

  • Statistics within the math department
  • Classes:
    • Intro stats
    • Year-long junior level prob & math stats sequence
    • Applied/methods classes in other departments
  • New class: MATH241 Case Studies in Statistical Analysis

Data Science

Class Description

Class Structure

Prereqs

Only intro stats and some exposure to R

Syllabus

  • 5 biweekly mini-reports submitted in R Markdown: reproducible research
  • Term project: both report and 20 min oral
  • In-class participation

Classroom

Classroom

Demographics

18 students, mostly juniors and seniors.

Major Count
Mathematics 4
Biological Science: Biology & Biochem and Molecular Biology 4
Other Science: Chemistry, Environmental Studies, Physics 4
Social Science: Political Science, Sociology 2
Economics 2
Misc: Psychology, Linguistics 2

Principles

ASA's GAISE Reports

  • Use real data.
  • Stress conceptual understanding, rather than mere knowledge of procedures.
  • Foster active learning in the classroom.
  • Use technology for developing conceptual understanding and analyzing data.

In Practice

  • Messy data needing cleaning, from potentially disparate sources
  • Bottom-up: Let questions/data motivate the statistical methodology, rather than vice-versa
  • Discussions in class
  • Focus on the entire analysis pipeline: article in Nature
  • Lean on R heavily

Tools

Environment: RStudio

How to get students to use R?

  • Key: Forget base R
  • How? The Hadleyverse of packages by Hadley Wickham.
  • In particular
    • dplyr package for data wrangling/manipulation
    • ggplot2 package for data visualization

dplyr Package

Features

  • Data manipulation is performed using verbs
  • The pipe %>% command. For example: say you want to apply functions h() then g() then f() on data x. You can do
    • f(g(h(x))) OR
    • h(x) %>% g() %>% f()

Example: Houston Flights

Info on all domestic flights leaving Houston (IAH) in 2011:

  • flights: info on all 227,496 flights
  • planes: information on all 2853 airplanes

What are the top 5 carriers using the oldest planes (averaged over all flights)?

Flights

The flights dataset:

date dep arr carrier flight dest plane
2011-01-01 1400 1500 AA 428 DFW N576AA
2011-01-02 1401 1501 AA 428 DFW N557AA
2011-01-03 1352 1502 AA 428 DFW N541AA
2011-01-04 1403 1513 AA 428 DFW N403AA
2011-01-05 1405 1507 AA 428 DFW N492AA

Planes

The planes dataset:

plane year model mfr no.seats
N576AA 1991 DC-9-82(MD-82) MCDONNELL DOUGLAS 172
N557AA 1993 KITFOX IV MARZ BARRY 2
N403AA 1974 S55A RAVEN 1
N492AA 1989 DC-9-82(MD-82) MCDONNELL DOUGLAS 172
N262AA 1985 DC-9-82(MD-82) MCDONNELL DOUGLAS 172

Example: Age of Planes

The following sequence of verbs wrangle/manipulate the data:

left_join(flights, planes, by='plane') %>%
  select(carrier, plane, year) %>%
  mutate(age = 2011 - year) %>%
  group_by(carrier) %>%
  summarise(avg_age = mean(age)) %>%
  arrange(desc(avg_age)) %>%
  top_n(5)

Example: Age of Planes

carrier avg_age
MQ 29.421
AA 24.325
DL 20.760
US 19.078
UA 14.635

ggplot2: the Grammar of Graphics

     

ggplot2: the Grammar of Graphics

A statistical graphic consists of a mapping of data variables to aesthetic attributes of geometric objects that we can observe.

ggplot2 allows us to construct graphics in a modular fashion by specifying these components.

ggplot2: the Grammar of Graphics

ggplot2: the Grammar of Graphics

Data (Variable) Aesthetic Geometric Object
longitude x position points
latitude y position points
army size size = width bars
army direction color = brown or black bars
date (x,y) position text
temperature (x,y) position lines

Results

Delayed Flights

Age of Airplanes

Dataset: OkCupid Data

  • Sample of 10% of San Francisco OkCupid users in June 2012 (\(n=5995\))
  • 40.2% of the sample was female
  • Use logistic regression to predict gender

Job

Self-Referenced Body Type

Best predictors have distinct differences (in gender) in large segments of the population.

Dataset: Reed Jukebox

All 222,540 songs played on the Reed pool hall jukebox from 2003-2009 c/o Noah Pepper '09

Dataset: Reed Jukebox

date_time artist album track
Sun Dec 7 05:12:57 2003 Tom Petty and the Heartbreakers Into the Great Wide Open
Sun Dec 7 05:15:56 2003 Jefferson Airplane Somebody To Love
Sun Dec 7 05:23:04 2003 Led Zeppelin Led Zeppelin IV 08 When The Levee Breaks

Artist Popularity

Time Series

Maps

Interactive Shiny Apps

The Future

Statistics' Image Problem

  • You hear this a lot:
    • Statistician: Hi, I'm a statistician.
    • Non-statistician: Statistics? I hated that class!
  • You'll never hear this:
    • Statistician: Hi, my work involves a lot of data visualization.
    • Non-statistician: Data visualization? I hate that stuff!

Solution: Data Visualization

  • Data visualization is a backdoor way to get students interested in statistics.
  • Prez from Season 4 of "The Wire":

Impact on my Intro Stats Classes

This is the only stats class many will take.

  • Get them looking at, manipulating, and visualizing data QUICK
  • Better integration of lectures and labs
  • Flipped-classroom
    • Lab exercises at home
    • Problem solving/debugging and discussion in class

Conclusions

Take Home Messages

  • A class focused on the data first, methods second, using only open-source tools.
  • Rich Majerus wrote "Why should students at a small liberal arts college learn R?"
  • New tools like Datacamp are increasing the ratio: \[\frac{\mbox{Payoff from learning R}}{\mbox{Startup costs}}\]
  • Data visualization is a gateway drug for statistics
  • Developing skills and intuition takes time. Instructor attention and feedback are crucial.
  • Interactivity boosts student interest

Resources